Word Segmentation in the Spoken Dutch Corpus

نویسندگان

Jean-Pierre Martens

Diana Binnenpoorte

Kris Demuynck

Ruben Van Parys

Tom Laureys

Wim Goedertier

Jacques Duchateau

چکیده

ELIS, University of Ghent, Sint-Pietersnieuwstraat 41, B-9000 Ghent, Belgium martens,odul,rvparijs @elis.rug.ac.be Dept Language & Speech, University of Nijmegen, P.O. Box 9103, 6500 HD Nijmegen, The Netherlands [email protected] ESAT, K.U.Leuven, Kasteelpark Arenberg 10, B-3001 Heverlee, Belgium kris.demuynck,tom.laureys,jacques.duchateau @esat.kuleuven.ac.be Abstract This paper describes the aims of the word segmentation in the Spoken Dutch Corpus (Corpus Gesproken Nederlands, CGN), and the procedures to create it. For one million words, a manually verified segmentation will be created, whereas the remaining nine million words will only come with an automatically generated segmentation. Described are our efforts to create the best possible automatic word segmentation from an auditory verified phonetic transcription, and the development of a protocol for the manual verification of that automatic segmentation. The paper also mentions some figures concerning the manual verification of the first hundred thousand words.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Assessing Segmentations: Two Methods for Confidence Scoring Automatic HMM-Based Word Segmentations

The Dutch-Flemish project Spoken Dutch Corpus (1998-2003) aims at the development of an annotated corpus of 10 million spoken words. In order to make the speech data easily accessible, a word segmentation couples the orthographic transcription to the speech signal by means of time stamps. Generally, such segmentations are produced manually. Since this manual procedure is a time-consuming effort...

متن کامل

Building a corpus of spoken Dutch

In this paper the Spoken Dutch Corpus Project is presented, a joint Flemish-Dutch undertaking aimed at the compilation and annotation of a 10million-word corpus of spoken Dutch. Upon completion, the corpus will constitute a valuable resource for research in the fields of computational linguistics and language and speech technology. The paper first gives an overview of the project. It then goes ...

متن کامل

Automatic Phonemic Labeling and Segmentation of Spoken Dutch

The CGN corpus (Oostdijk, 2000) (Corpus Gesproken Nederlands/Corpus Spoken Dutch) is a large speech corpus of contemporary Dutch as spoken in Belgium (3.3 million words) and in the Netherlands (5.6 million words). Due to its size, manual phonemic annotation was limited to 10% of the data and automatic systems were used to complement this data. This paper describes the automatic generation of th...

متن کامل

Harvesting Dutch Trees: Syntactic Properties of Spoken Dutch

In this paper, we report on quantitative research into certain word order phenomena in Dutch. In our research, we use the Spoken Dutch Corpus (CGN), a major new resource for research into contemporary spoken Dutch. After briefly introducing the primary data, the annotations added, and some of the tools to explore the primary data and the annotations, we illustrate how the Corpus may be utilized...

متن کامل

The Spoken Dutch Corpus. Overview and First Evaluation

In this paper the Spoken Dutch Corpus project is presented, a joint Flemish-Dutch undertaking aimed at the compilation and annotation of a 10-million-word corpus of spoken Dutch. Upon completion, the corpus will constitute a valuable resource for research in the fields of computational linguistics and language and speech technology. The paper first gives an overall description of the project, i...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2002

Word Segmentation in the Spoken Dutch Corpus

نویسندگان

چکیده

منابع مشابه

Assessing Segmentations: Two Methods for Confidence Scoring Automatic HMM-Based Word Segmentations

Building a corpus of spoken Dutch

Automatic Phonemic Labeling and Segmentation of Spoken Dutch

Harvesting Dutch Trees: Syntactic Properties of Spoken Dutch

The Spoken Dutch Corpus. Overview and First Evaluation

عنوان ژورنال:

اشتراک گذاری